Compiling MPI for Many-Core Systems

نویسندگان

  • G. Bronevetsky
  • A. Friedley
  • T. Hoefler
  • A. Lumsdaine
  • D. Quinlan
چکیده

Processors with multiple (or many) cores and shared memory are becoming ubiquitous across the computing spectrum. MPI, the current de facto programming model for scalable parallel applications, enforces copies between source and target processes and thus can not fully utilize shared memory and cache architectures of modern machines. To enable MPI-based programs to more fully exploit features of multiand many-core architectures, we present a compiler-based transformation that transforms MPI processes into threads and fuses message serialization and deserialization loops such that send and receive calls can be replaced by direct memory accesses. Our compiler replaces most of the MPI communication functions with direct load/store accesses and our runtime provides a threaded MPI implementation to handle the remaining functions. We show the utility of our transformation with two applications, a molecular dynamics code, MiniMD, and a two-dimensional parallel fast Fourier transform (FFT). Our benchmarks show that our loop fusion techniques reduce communication times up to 43% for MiniMD and up to 59% for the FFT on modern multi-core systems. Our techniques will enable the automatic transformation of existing MPI codes to take advantage of modern shared memory architectures. In the future, this approach will aid in the transformation of MPI codes to a hybrid communication model that achieves high performance on a wide range of systems ranging from individual nodes to very large clusters of many-core nodes.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

MPI- and CUDA- implementations of modal finite difference method for P-SV wave propagation modeling

Among different discretization approaches, Finite Difference Method (FDM) is widely used for acoustic and elastic full-wave form modeling. An inevitable deficit of the technique, however, is its sever requirement to computational resources. A promising solution is parallelization, where the problem is broken into several segments, and the calculations are distributed over different processors. ...

متن کامل

Using Processor Partitioning to Evaluate the Performance of MPI, OpenMP and Hybrid Parallel Applications on Dual- and Quad-core Cray XT4 Systems

Chip multiprocessors (CMP) are widely used for high performance computing. While this presents significant new opportunities, such as on-chip high inter-core bandwidth and low inter-core latency, it also presents new challenges in the form of inter-core resource conflict and contention. A challenge to be addressed is how well current parallel programming paradigms, such as MPI, OpenMP and hybri...

متن کامل

Efficient parallelization of the genetic algorithm solution of traveling salesman problem on multi-core and many-core systems

Efficient parallelization of genetic algorithms (GAs) on state-of-the-art multi-threading or many-threading platforms is a challenge due to the difficulty of schedulation of hardware resources regarding the concurrency of threads. In this paper, for resolving the problem, a novel method is proposed, which parallelizes the GA by designing three concurrent kernels, each of which running some depe...

متن کامل

Deadlock Detection in Basic Models of MPI Synchronization Communication Programs

Deadlock Detection in Basic Models of MPI Synchronization Communication Programs LIAO Ming-xue, FAN Zhi-hua (Institute of Software, the Chinese Academy of Sciences, Beijing 100080, China) Abstract: A model of MPI synchronization communication programs is presented and its three basic simplified models are also defined. A series of theorems and methods for deciding whether deadlocks will occur a...

متن کامل

Natively Supporting True One-sided Communication in MPI on Multi-core Systems with InfiniBand

As high-end computing systems continue to grow in scale, the performance that applications can achieve on such large scale systems depends heavily on their ability to avoid explicitly synchronized communication with other processes in the system. Accordingly, several modern and legacy parallel programming models (such as MPI, UPC, Global Arrays) have provided many programming constructs that en...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2013